JOPSS - Search Results

Search Results: Records 1-6 displayed on this page of 6

Presentation/Publication Type

Initialising ...

Refine

Journal/Book Title

Initialising ...

Meeting title

Initialising ...

First Author

Initialising ...

Keyword

Initialising ...

Language

Initialising ...

Publication Year

Initialising ...

Held year of conference

Initialising ...

Journal Articles

Acceleration of wind simulation using locally mesh-refined Lattice Boltzmann Method on GPU-Rich supercomputers

Onodera, Naoyuki; Idomura, Yasuhiro

Lecture Notes in Computer Science 10776, p.128 - 145, 2018/00

https://doi.org/10.1007/978-3-319-69953-0_8

Times Cited Count：10 Percentile：85.61(Computer Science, Artificial Intelligence)

We developed a CFD code based on the adaptive mesh-refined Lattice Boltzmann Method (AMR-LBM). The code is developed on the GPU-rich supercomputer TSUBAME3.0 at the Tokyo Tech, and the GPU kernel functions are tuned to achieve high performance on the Pascal GPU architecture. The performances of weak scaling from 1 nodes to 36 nodes are examined. The GPUs (NVIDIA TESLA P100) achieved more than 10 times higher node performance than that of CPUs (Broadwell).

Journal Articles

A Stencil framework to realize large-scale computations beyond device memory capacity on GPU supercomputers

Shimokawabe, Takashi*; Endo, Toshio*; Onodera, Naoyuki; Aoki, Takayuki*

Proceedings of 2017 IEEE International Conference on Cluster Computing (IEEE Cluster 2017) (Internet), p.525 - 529, 2017/09

Stencil-based applications such as CFD have succeeded in obtaining high performance on GPU supercomputers. The problem sizes of these applications are limited by the GPU device memory capacity, which is typically smaller than the host memory. On GPU supercomputers, a locality improvement technique using temporal blocking method with memory swapping between host and device enables large computation beyond the device memory capacity. Our high-productivity stencil framework automatically applies temporal blocking to boundary exchange required for stencil computation and supports automatic memory swapping provided by a MPI/CUDA wrapper library. The framework-based application for the airflow in an urban city maintains 80% performance even with the twice larger than the GPU memory capacity and have demonstrated good weak scalability on the TSUBAME 2.5 supercomputer.

Oral presentation

High performance implementation of nuclear fusion simulation code on GPU cluster

Matsumoto, Kazuya; Asahi, Yuichi*; Ina, Takuya; Idomura, Yasuhiro

no journal, ,

We present the implementation and performance evaluation results of the plasma physics simulation code called GT5D on a GPU cluster. In this study, an iterative matrix solver, which is identified as a performance bottleneck in the code, is tuned on the GPU. The measured performance is compared with attainable performance calculated by the roofline model. Additionally, we show the implementation with direction communications between GPUs for utilizing many GPUs.

Oral presentation

Acceleration of air flow simulation using lattice Boltzmann method

Onodera, Naoyuki; Idomura, Yasuhiro

no journal, ,

Since diffusion simulations of pollutants attract high social concern, high-precision and real-time analysis is required. We developed a CFD code based on LBM (Lattice Boltzmann Method) with AMR (Adaptive Mesh Refinement). In this presentation, we propose optimum data structure and calculation algorithm for real-time LBM analysis.

Oral presentation

Acceleration of turbulent wind simulation using locally mesh-refined Lattice Boltzmann Method

Onodera, Naoyuki; Idomura, Yasuhiro

no journal, ,

A real-time simulation of the environmental dynamics of radioactive substances is very important from the viewpoint of nuclear security. We develop a CFD code based on a Lattice Boltzmann Method (LBM) with a block-based Adaptive Mesh Refinement (AMR) method. The code is tuned to achieve high performance on the latest Pascal GPU architecture. By introducing a temporal blocking technique, the number of the MPI communications is significantly reduced.

Oral presentation

Development of FP16 data/FP32 computation mixed-precision preprocessing for ill-conditioned matrices in multi-phase CFD simulations

Ina, Takuya; Idomura, Yasuhiro; Imamura, Toshiyuki*; Yamashita, Susumu; Onodera, Naoyuki

no journal, ,

We have developed mixed-precision preprocessing for the preconditioned conjugate gradients (PCG) method in the multi-phase multi-component thermal-hydraulic code JUPITER. The preconditioner employs a hybrid mixed-precision approach which combines FP16 data and FP32 operations. The roundoff errors are reduced by converting FP16 data to FP32 on cache, holding the intermediate result in FP32, converting the final result to FP16, and returning it to the memory. The developed preconditioner was tested for large-scale problems with 3D structured grids of 3,2002,00014,160. The convergence of the PCG method was maintained even when the FP16 data format was used for ill-condition matrices, and the computational speed was dramatically increased by reducing the memory access. The hybrid FP16/32 mixed-precision implementation achieved 1.79 speedup from the FP64 implementation at 2,000 nodes on Fugaku.